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Abstract. This paper discusses an approach to representing and reasoning about 
constraints over strings. We discuss how many string domains can often be con- 
cisely represented using regular languages, and how constraints over strings, and 
domain operations on sets of strings, can be carried out using this representation. 


1 Introduction 

Constraint satisfaction problems (CSPs) involve finding values for variables subject to 
constraints that permit or exclude certain combinations of values. Since many tasks in 
computer science [12,5,24] and many real-world problems [25,13,17,22] can be formu- 
lated as CSPs, they have been attracting widespread research and commercial interests 
for the last two decades. Whereas much work has been done on constraints over fi- 
nite discrete domains and numerical intervals, constraint reasoning over strings, by and 
large, remains pretty much unexplored. 

Strings appear everywhere. Like any other objects in the real-world, certain relation- 
ships exist among strings and between strings and other objects. In many real-world ap- 
plications those relationships can be formalized as constraints over strings. For example, 
we are applying constraint-based planning to automate certain operations in software 
domains [8,9], domains in which the actions are operations in a software environment, 
such as moving files, searching for information on the internet or image processing. One 
characteristic of nearly all software domains is the ubiquity of strings and constraints. 

File path names, URLs and the contents of text files and web pages are all represented 
as text, which often obey specific constraints. For instance, many programs have inputs 
or outputs in the form of files, whose names follow some canonical form: 

- A Java compiler expects the pathname for the source code of class “my.package.MyClass 
to be “my/packageMyClass.java,*’ and it produces a file ‘‘my/package/MyClass.class.” 

- The pathname of data downlinked from a spacecraft or planetary rover is often in 
a form like “phase2/sol29/my_instrument/seq0002,jpg,” where each component of 
the pathname refers to some meaningful aspect of the data. 

- The contents of structured or semistructured text files can be described in terms of 
constraints between the text and what the text represents. 

A distinguishing characteristic of software domains and others involving strings is that 
the set of strings corresponding to a variable representing a given name, input or file is 
either infinite or so large that listing them ail would require unacceptable amounts of 


time and storage. The challenge of effectively representing and reasoning about con- 
straints on strings irfolrepreseht infini^^ stnng sets without actually requinng 
space and to deal with constraints over infinite string sets without exhustively listing 
infinite string values. In this paper, we provide such a string representation, based on 
regular languages; we discuss how common string constraints are defined and handled 
using this representation; and we show how the string constraint problems can be solved 
within the general-prupose constraint reasoning framework we have developed for an 
on-going constraint-based planning project. 

The remainder of the paper is organized as follows. In Section 2, we review nota- 
tions of constraint satisfaction problems. In Section 3, we discuss string domain rep- 
resentations, namely, as regular languages. In Section 4, we provide definitions of the 
constraints on strings and desribe how they are enforced using this domain representa- 
tion. In Section 5 we discuss how standard domain operations, such as intersection and 
determining equality or cardinality, are handled, hi Section 6, we analyize the compu- 
tational complexity of all the operations involved in constraint reasoning using regular 
domains. In Section 7, we show how the string constraints can be applied to solving 
some interesting problems. And finally, in Section 8 we conclude by summerizing our 
contribution. 


2 Constraint Satisfaction Problems 

A Constraint Satisfaction Problem (CSP) is a representation and reasoning frame- 
work consisting of variables, domains, and constraints. Formally, it can be defined 
as a triple < X,D,C > where X = {x:i,x: 2 , . .. is a finite set of variables, D = 
{d{xi),d{x 2 ), . . ‘,d{xn)] is a set of domains containing values the variables may take, 
and C = {Ci,C 2 , . . M^m} is a set of constraints. Each constraint Q is defined as a rela- 
tion Rom. subset of variables V = {xijXj,. . . called the constraint scope. R may be 
represented extensionaiiy as a subset of Cartesian product d{xi) x d{xj) x . . . x d(xk). 
A constraint Q = (V;,/?/) limits the values the variables in V can take simultaneously 
to those assignments that satisfy R. Let Vjc = {jcjti , . . . be a subset of X, An /-tuple 
(xjtj , . . . from ) X , . . X d{xk^) is called an instantiation of variables in Vfc- An 
instantiation is said to be consistent if it satisfies all the constraints restricted in A 
consistent instantiation of all variables in X is a solution. The central reasoning task (or 
the task of solving a CSP) is to find one or more solutions. 

A CSP can be solved by search using, e.g., standard backtracking algorithm [4,10]. 
However, for CSPs with infinite domains such as the ones we are interested in this paper, 
it is not guaranteed that a solution can be found by search alone, because it is infeasi- 
ble to enumerate all values of infinite variable domains. Instead, the CSPs with infinite 
domains need to be relaxed by consistency enforcement before or during the search. En- 
forcing local consistency eliminates inconsistent values from variable domains [16,3]. 
In theory, if a given CSP has only one solution, enforcing a certain level of consistency 
will eventually make every variable domain a singleton domain; if the CSP has more 
than one solution, or infinitely many solutions, every remaining value in the domain 
after consistency enforcement will be part of a solution. In practice, an effective con- 
straint solving strategy enforces a certain level of consistency such as generalized arc 



consistency [ 1 8, 1 9] at each node of the search tree. A key issue is the trade-off between 
tinre~spenron'propaganon~^(rtnerTeauctroTrm"search~spaceTieeded-to“allow^easible- 
and effecient search. Based on our experience dealing with constraint-based planning in 
software enviomment, much depends on how the variable domains are represented and 
how the constraints are evaluated or executed to enforce consistency. In the next three 
sections, we focus on string domain representation and a definition of constraints over 
string domains. These string constraints are in the constraint library of the constraint 
reasoning system we implemented and, together with other numerical and boolean con- 
straints, are used to model the planning problems. 

3 String Domains 

The domain d{x) of variable x is the set of values that x can take. This set will, in 
general, change during the course of search and constraint propagation. Typically, a 
variable’s domain is represented as a list of the values that the variable can take. For 
numeric domains, we can instead represent a domain as an interval, yielding substantial 
decreases in space and time requirements and making it possible to represent an infinite 
set of values [11] 

In the domains of interest, we frequently want to represent infinite, or very large, sets 
of strings, such as all possible pathnames matching a given pattern. Representing this 
set as a list is clearly infeasible, since it is infinite. Intervals are equally inappropriate. 
While it is possible to represent some sets of strings as intervals, such as all names 
between “Jones” and “Smith” in the phone book, such intervals are far less useful in 
practice than they are numeric intervals. 

However, there is an alternative representation of sets of strings that is far more 
useful, as evidenced by its ubiquity: regular languages. Regular languages are sets of 
strings that are accepted by regular expressions or finite automata, which are widely 
used in string matching, lexical analysis and many other applications. Although there 
are many languages that are not regular, such as palindromes, regular languages provide 
a nice tradeoff between expressiveness and tractability. 

As we will discuss, not only can we enforce generalized arc consistency (GAC) 
[3] for a wide range of useful string constraints when the domains are represented as 
regular languages, but we can perform the domain operations necessary for constraint 
propagation and search. 

Regular languages are a much more flexible representation than intervals, in that the 
set of regular languages is closed under intersection, union and negation, whereas the 
set of intervals is only closed under intersection. 

We use two different representations of regular languages: regular expressions and 
finite automata. Regular expressions are used for input and are converted to FAs, which 
are used computationally. Since regular expressions and FAs are well known, we will 
not discuss them in depth, but we will briefly review for the sake of defining our termi- 
nology. 

A regular expression represents a regular language over an alphabet X. In our imple- 
mentation, X is the set of Unicode characters. We use the following notation to describe 
regular expressions. 



Expression 

Accept 

[abc] — 

~ one of the chara'Cters aj-b, cr^ 

[a-c] 

one of the characters in the range a — c 

-'[abc] 

any character in Z except a,b,c 


any character in Z 

\c 

the literal character c 

re\re2 

followed by re 2 

rei\ret 

either r^ior rc 2 

re^ 

zero or more repetition of re 

re+ 

one or more repetitions of re 

re? 

zero or one occurrences of re 

(re) 

re (used to override precedence) 


The purpose of the notation \c is to “quote” symbols that would otherwise be interpreted 
as syntax characters. For example, \[ can be use to refer to the character and “W” 
refers to the character 

We represent regular languages internally using FAs, since the latter are easier to 
compute with that regular expressions. An FA is a pair < 5, CT >, where J is a set of 
states and T is a set of labeled transitions between the states. Each transition in T" is a 
triple < ri 2 >, which we will write < «i na >, where niis the starting state of the 
transition, n 2 is the ending state and / € X is the transition label. The input to the FA is a 
sequence of symbols from Z. Whenever there are symbols left to read, the FA reads the 
next symbol and follows a transition from the current state whose label / is the symbol 
just read. If there are multiple transitions labeled /, one is chosen nondeterministically. 
If there are no transitions labeled /, the FA halts and returns failure. For efficiency, we 
allow transitions to have sets of labels, represented using the same notation as is used 

[a-zA-Z] 

for regular expressions. For example, we could have a transition < n\ ri 2 >, 

meaning the transition will be taken if the symbol is any character from the English 
alphabet. This is logically equivalent to having a separate transition for each symbol. 
For notational convenience, we also refer to transitions labeled with e. An e-transition 
is always applicable and can be followed without reading any characters. An FA has a 
single start state, which is always the first state, J[0], and zero or more accept states. 
To determine whether a string s is in the language accepted by an FA < >, we 

start the FA in S[0] and have it read s until there are no characters left to read. If, at that 
time, the FA is in an accept state, then s is in the language. Otherwise, it is not. In our 
visual depiction of FAs, states, transitions, start states and accept states are represented 
as follows: 


© ©-Mi) -© © 

state transition start state accept state 


A deterministic finite automaton (DFA), is an FA with no epsilon transitions and in 
which there only one transition out of every state for each label / 6 Z. An FA that does 
not satisfy these conditions is a nondeterministic FA (NFA). In the remainder of the pa- 
per, we will assume an FA is an NFA unless stated otherwise. As is well known, NFAs 



and DFAs have equivalent expressive power, in that both accept the family of regular 

a regular expression or FA a regular domain. 

Regular expressions and FAs [15] have been used in many application domains 
involving strings, such as data mining from databases or from web for discovering in- 
teresting data patterns and web structures. For example, in [6], the authors addressed 
the issue of mining frequent sequences from a database of sequences in the presence 
of regular expression constraints (see [1] for detailed discussion on the issue of mining 
sequential patterns). Regular expression constraints are user-defined sequence patterns 
that are used to match strings in the database or web during query or search. Our work 
differs from past work in that we do not simply use regular languages to match fixed 
strings. Rather, we use them to propagate constraints among string variables, whose do- 
mains may be infinite. For example, match is indeed a common constraint in our library. 
However, the string being matched need not be singleton. In addition to match, many 
other types of string constraints appearing in real- world problem need to be represented. 
We discuss some common ones in the next section. 

4 Constraints 

Constraints are usually defined as mathematical formulations of relationships to be held 
between objects. For example, ;c-hy — z is a constraint describing an equalitiy relation 
that holds among three numeric variables x, y, and z. Similarly, for the string variables x, 
y, and z, we can define a string constraint as x -t-y = z which represents a concatenation 
relation; that is, string z is the concatenation of strings x and y. We have implemented 
a number of string constraints in our constraint reasoning framework, which supports 
generalized arc consistency (GAC), even on infinite sets of strings. In the following, we 
give definitions of these constraints, illustrated by how they are enforced using FAs. 

4.1 Matches 

One of the constraints in the library tests whether a string matches a given regular 
expression: 

matches(string x, string re) 

Although matches takes two arguments, it is essentially a unary constraint, because it 
is not enforced unless the domain of re is a singleton, in which case it computes the 
FA corresponding to the regular expression represented by re and intersects it with the 
domain of x. Matches subsumes all possible unary constraints over strings expressible 
in our formalism, so other possible constraints, such as allUpperCase in isAlphaNumeric 
are not implemented. Matches is used in type constraints to define the initial domains 
of variables of given subtypes of string. For example, we can define a Unix filename as 
any string of non-zero length that does not contain the character 7’: 


matches(/n, “*-[/]+”) 



and we can define a time as a string of the form HH:MM:SS: 


matches(^, “(([0 - 1][0 - 9]) | (2[0 - 3])) : [0 - 5][0 - 9] : [0 ~ 5][0 - 9]") 

4,2 Concatenation 



One of the most obvious operations on strings is concatenation. The concatenation 
of two strings, X and y, yields another string, z, which consists of all the characters of x 
followed by all the characters of y: 

concat(z,x,y) 

This can be generalized to concatenation of three or more strings in the obvious way. 
If the domains of x and y are regular, the domain of z will simply be the result from 
concatenating the FA representations of x and y — that is, adding e-transitions from the 
accept states of the FA for x to the start state of the FA for y, as shown in Figure 1, 
obviously a linear-time operation. 

Less obviously, if the domains of x and z are regular, the domain of y is also regular. 
To construct an FA for y given FAs for x and z, we in effect traverse the FAs for z and x 
in parallel, exploring the cross-product of the nodes from the two FAs, starting with the 

pair of initial states and adding a transition {sn,tm] {sp^tq} from every node {snJm} 

and every label lab such that the transitions Sn ^ Sp and ^ tq appear in the original 
FAs (see Figure 2). This is simply the operation that is performed when intersecting 
two FAs. Whenever we reach a node such that node r is an accept state in the FA 
for X, we mark node t. After the traversal is complete, the marked nodes in the FA for z 
represent all of the states that can be reached by reading a string accepted by x. 

A new nondeterministic FA (NFA) for y is constructed by copying the FA for z, 
making the start node a non-start node and making all the marked nodes new start 
nodes. The complexity of the whole operation is dominated by generating the cross- 
product FA (0(/nn), where m and n are the number of nodes in the FAs for x and z, 
respectively). A similar procedure can be used to construct an NFA for x, given FAs for 
y and z. 


I 




Fig, 2. Given FAs for jr(upper left) and z (upper right), find an FA for y such that z is concatenation 
of X and y. First, traverse FAs for z and x in parallel, constructing cross-product FA (lower left). 
Then, identify states that are accept states for x and mark the corresponding states in the FA for z 
(shaded circles). Construct a new NFA (lower right) for y by copying FA for z and making marked 
nodes start nodes. 


4.3 Containment 
The relation 

contains(String a. String b) 

means that string is a substring of a. If the domain of is a regular language r, then the 
domain of a is simply the regular expression Given an FA for r, we can create 

an FA for by adding new start and accept states that have self-loops on any string 
(“.’*), and connect them to the original start and accept states using 8-transitions (Figure 
3). If we have some other FA representing the domain of a, we simply intersect that 
domain with the domain for *‘.*r.*”. 




Fig. 3. Given an FA for a regular language r, construct a new FA for strings that contain 
strings in r. 


Less obviously, if the domain of a is regular, then so is the domain of b. Given an 
FA for a, we can construct an NFA for b by eliminating any dead-end nodes from a (that 
is, nodes from which it is impossible to reach an accept node), adding a new start states, 
with e-transitions to all states, and then making all states in a accept states (Figure 4). 
Again, we simply intersect this domain with the original domain for b to enforce the 
constraint. 




Fig. 4. Given an FA for a regular language r, construct a new FA for ail substrings of strings in r. 


4.4 Length 

Constraints on the length of a string can also be represented using FAs: 



l«ogth >- 5 length <= 5 


As the bottom two examples show, intervals over the length are simple to represent; if 
we have a constraint of the form length(^,n), and the domain of n is represented as a 
finite interval, we can enforce the constraint without waiting until ^becomes singleton. 
We simply construct a linear FA whose size is one plus the upper bound of n, and label 
all of the states whose position exceeds the lower bound as accept states. Similarly, if 
d(n) = [x ,oo), we construct a linear FA of size x-f 1 and make the last state an accept 
state with a self-transition. 

Conversely, if we have a regular domain representation of j, we can obtain lower 
and upper bounds for n by determining the shortest and longest paths from the start 
state to an accept state, a linear-time operation. If there is no upper limit on the size, 
there will be a loop along a path to an accept state. 

4.5 Other constraints 

Many other string constraints are straightforward to represent. To reverse all strings in 
a regular domain, we simply reverse the direction of all the transitions and reverse the 
status of start and accept states in the FA. To substitute one character for another, we 
could perform the substitution on the labels of the transitions. Subsequences of strings 
could be obtained using a combination of concat and length. For example, to specify 
the 5-character prefix p of string s, we can write jength(/?,5)Aconcat(j,/», r), where r is 
an uncontrained string. 

Another common operation on strings is to specify the character at a given location 
of the string: characterAt(^,n,c), where c is the character at position n of string s. We 
will assume than n is a constant. The case where n is a variable can be handled in a 
similar fashion, but is more complex. We apply the same general idea as the length 


constraint. In fact, for the character at position n in a string to have any value at all, 
■^thrstmrgTnxisf1?^i:^'asT=?Rtorai^ters=toTr^strthe^h^ 
constraint length > n, with the addition that the label of the transition leading to the 
accept state is restricted to the domain of c. 

Given the domain of s, we could similarly determine the domain of c in 0(n(|5l + 
|T|)), by finding all states reachable in n — 1 transitions from the start state, then taking 
the union of the labels of transitions from which it is possible reach an accept state. 

Of the constraints we have discussed, matches, concat, contains and reverse are 
implemented in our constraint library. Implementation of the others is left as future 
work. 

5 Domain operations 

In order to effectively eliminate inconsistent values from regular domains during con- 
straint propagation, we need to be able to perform set operations on the domains, in- 
cluding intersected two domains, determining whether one is a subset of another, and 
determining whether a domain is empty or singleton. We can perform these operations 
easily using FAs. It is well known that regular languages are closed under intersection, 
union and negation, and the algorithms for performing these operations on FAs are both 
straightforward and widely known, so we will not repeat them here, but we illustrate 
them graphically as a reminder. 




negation 


Of these set operations, intersection is used frequently in constraint propagation and 
negation is useful for domain subtraction, subset tests and other operations, but union 
is not a common set operation for domains. Superficially, it may seem that intersection 





is the most expensive operation, since it potentially generates the cross-product of its 
1nputsrwhefearumOTI=:aiI^=neptro^^ 

produces an NFA and negation requires a DFA. Converting an NFA to a DFA potentially 
generates the power set of the NFA, an exponential blowup. 

Given these operations we can apply the following definitions to compute subset 
and equality relations between two domains: 
ifai Q fat) = hfat r\fa\ = 0) 

(fai = fat) = {fu2 C fax) A {fax C faf) 


5.1 Domain Size 


It is important be be able to determine the size of a domain. For example, if the size is 
0 (empty), then the constraint network is inconsistent. If the size is 1, then a value for 
the corresponding variable is determined. If the size is small and finite, then it may be 
appropriate to explicitly select a value in a search for a solution, but if the size is infinite, 
then such a search may never terminate. Determining the size of a regular domain is less 
straightforward than determining the size of a set or interval domain, but it can still be 
done fairly efficiently. 

Given an FA, we can determine the number of strings in the language as follows. We 
begin by removing all dead-end states from the FA, a linear-time operation. A dead-end 
state is a state from which it is impossible to reach an accept state. Once the dead- 
end states are removed, if the FA contains any loops, then there are infinitely many 
solutions, because we can follow a loop any number of times and then follow a path to 
an accept state. We perform a topological sort of the FA, an operation that is linear in the 
number of arcs. If the sort fails, then there is a loop and thus infinitely many solutions. 
Otherwise, we traverse the graph in the order dictated by the topological sort, keeping 
track of the number of paths there are from the initial state to the current state: 

size(< 5, T >) 


sort S topologically 
pathsFromInit[0] = 1 


numSolutions ^ 
for / = 1 to 151 


r lifi 

\ 0 oth 


if isFinal(5[0]) 
otherwise 


foreach transition <ni^nd>£^ starting from n/ 
if isFinal(n^) 

[ numSolutions += pathsFromInit[i] 
pathsFromlnitM += pathsFromInit[i] 
return numSolutions 


6 Complexity 

All of the set operations and string constraints we have discussed are either linear or 
quadratic in the size of the FAs representing the string domains. However, many oper- 
ations, such as union, produce NFAs as outputs, and some, such as negation, require 


DFAs as inputs. As noted, converting an NFA to a DFA may result in exponential 
“blowup"rn'tne“siz^of"the~FAT‘furtftennDrereven“wnBn“every"operatron“on’Tne"i^A"re“ 
suits in a polynomialiy larger FA, that can still mean exponential growth in the number 
of operations, i.e., the number of constraints that contain the variable whose domain is 
represented by the FA. Ultimately, how the FA grows will depend on the nature of the 
problem at hand. The FA representation can be viewed as a compression of the full sets 
of strings. It will tend to do well at compressing sets with a lot of symmetry and simple 
structure, but will not do so well at compressing arbitrary lists of strings, where there 
is little or no structure to exploit. In the latter cases, the representation will blow up, 
converging toward an explicit list of the members. The exponential blowup in the rep- 
resentation can be viewed as a failure in the exponential reduction that FAs are capable 
of providing. 

Using regular domains is worth considering in problems in which one of the fol- 
lowing holds: 

1. There is a great amount of symmetry or the domain is highly under-constrained. 
In this case, the benefit of a precise domain representation should outweigh the 
negligible cost in time and storage. 

2. It is necessary to explicidy consider aU possible domain values or solutions to the 
CSR In this case, the domain will have to be enumerated one way or the other. In 
the worst case, a minimized FA requires space that is linear in the size of the list of 
strings and could be arbitrarily better. 

3. There are constraints over strings of unbounded length. In this case, the domain is 
infinite. The only alternative to regular languages that we know of for representing 
infinite sets of strings is to represent every infinite domain as the full domain (the 
set of all strings). With regular domains, we can enforce generalized arc consistency 
over infinite sets of strings, making it possible to solve problems that could not be 
solved otherwise. 

7 Examples 

7.1 Pathname 

In Unix, sets of files are often represented using regular expressions on their pathnames. 
Correspondingly, regular domains are very useful for representing sets of files in a 
constraint-based planning problem. In addition to the ability to represent large sets con- 
cisely, we can also handle constraints that relate the file's pathname to other attributes 
of the file. For example, satellite images and other automatically generated data are typ- 
ically stored in ordinary filesystems, with pathnames based on details of the data, such 
as the time, subject, source, file format, etc. Suppose we have a remote archive in which 
satellite images have pathnames of the form: 

/ downlink/< year >/<dayOfYear>/< sensor> <gridx><gridy>. <format> 

We can represent this knowledge using a concatenation constraint: 

rpn= concatCVdownlink/”, y, d, “/”, 5, gx, gy, “ fmt). 



Given only this knowledge, all we know about rpn is that the set of files is characterized 
“by~iheTeguiarexpreSSron-Vdown±iiT}cy^/^/:*/T*\“"*-^Howeverrniost'^^^ 
quite a bit about the other variables. We know how many yeai*s the satellite has been in 
operation, how many days are in a year, the sensors aboard the satellite, the grid coor- 
dinate system used to indicate the regions covered by the images, and the available for- 
mats. Assuming we are interested in just a subset of the data, we can impose additional 
constraints on these variables to specify just the files we are interested in. For example, 
if we want MOD 17 data from January 27, 2002 in either HDF or binary format, then the 
domain of is ‘7downlink/2002/27/MOD17 [0-9] [0-9] [0-9] [0-9]\.(hdflbin)” 
String constraints are not just useful for specifying sets of files, but also specifying 
the effects of file operations. Since the files are on a remote server, we can't access them 
directly, but we can copy them to a local disk. Suppose we executed the command scp 
-r server : /downlink/2002 local02 to copy the contents of the directory 2002 to 
the directory local02. We can describe the effect on the pathnames of the resulting 
files using the pair of constraints: 

1. concat(r/?n, ‘Vdownl ink/2 002r, Idir) 

2. concat(/pn, ”local02/”, Wir) 

Since the concat constraint can be used to derive the domain of any variable, given the 
domains of the other two variables, and since we know that the domain of rpn (limited 
to the files we care about) is 

/downlink/2002/27/MOD17[0-9] [0-9] [0-9] [0-9]\.(hdflbin) 

we can enforce the first constraint to obtain the domain of Idir: 

27/MOD17[0-9] [0-9] [0-9] [0-9]\.(hdflbin) 

We can then apply the second constraint to obtain the domain of Ipn: 

local02/27/MOD17[0-9] [0-9] [0-9] [0-9]\.(hdflbin) 

If, after copying the files, we discovered that there are only HDF files, we could apply 
the same constraints in the other direction to conclude that there were no binary files on 
the server. 


7.2 Crossword Puzzle 

Another application of string constraints is to the crossword puzzle problem. Solving 
crossword puzzles is a very popular pastime and also a well-studied problem in com- 
puter science. The full problem of solving crossword puzzles, given only the puzzle 
layout and a list of clues, is a hard problem that involves many aspects of AI [7,23]. 
A more commonly addressed simplification of the problem, in which a list of possible 
words is given instead of clues, is more akin to creating crossword puzzles than solv- 
ing them. This problem becomes a classic constraint satisfaction problem, where the 
variables of the constraint problem are word slots on the puzzle board in which words 


can be written, the domains of variables are available words, and the binary constraints 
■'on"vanabies"enforce-theragreemenn 5 Heners"arincerseciions“between-slots-SolVing-a- 
crossword puzzle reduces to finding a solution to the constraint problem: an assignment 
of values to the variables such that each variable is assigned a value in its domain and 
no constraint is violated. 

We can use string constraints to formalize the crossword puzzle problem. There is 
a variable for each slot, each intersection point and each contiguous segment of text 
within a slot thatdoes not cross an intersection. The variables for word slots take values 
from all available words, the variables for intersection points take values of letters from 
the alphabet, the variables for segments take values of unknown strings of fixed length. 
Each word slot is constrained to be the concatenation of the segments and intersection 


points that it contains. 

For example, suppose that we have the following crossword puzzle that is taken 
from http://yodaxisJemple.edu:8080/UGAIWWWAectures95/search/puzzle.html 



The list of words: 

aft laser 

ALE LEE 

eel line 

HEEL SAILS 
HIKE SHEET 
HOSES STEER 

keel tie 

KNOT 


To formalize this puzzle as a CSP with string constraints, we have 

- 8 variables for the word slot as marked from $x_l$ to $x_ 8 $ 

~ 12 variables for those intersection points marked as $c_i$ 

- 9 variables for these segments marked as $b J$ 

We have 8 constraints as follows: 

1. concat(xi,Z>i,ci,Z?2,C2) 

2 . concat{x2,ci,b3yC3,c^,cio) 

3. concat{x3,C2,b4,cSiCZjCn) 

4. concat(x 4 , ^ 5 , C 3 , C 4 , C 5 ) 

5. concat(x 5 ,C 4 ,C 7 ,cil,^ 6 ) 

6. concat(x6,b7,C9,b8) 

7. concat(x7,C6,C7,C8) 

8. concat(x8,c9,fc9,cio,cii,ci2) 

It is worth noting that, comparing to the traditional CSP formalization, we may have 
many additional variables introduced to the formalized crossword puzzle problem, but 
only the xi variables, that is, those variables representing word slots, need to be searched 
during the CSP solving. Other variables will be assigned values by propagation. In fact, 
with the constraint system we implemented to support a constraint-based planner, we 
can solve the above crossword puzzle example without backtracking. 




7.3 Bioinformatics 


Constraint techniques have been applied to bioinformatics. For example, the authors in 
[14] reported their work on applying a constraint-based approach to determining protein 
structures. The problem of determining protein structures is modelled as a constraint 
problem where variables are the Cartesian coordinates of each atoms in the protein, 
and constraints are restrictions on these coordinates. In [20] an integer programming 
(IP) approach, which can be seen as a special case of constraint formulation, is applied 
to solving sequence alignment and protein threading problems in genetics. Numerical 
constraints are also applied to genome mapping [21] and protein structure prediction 
[ 2 ]. 

It is possible that string constraints could play an important role in applying con- 
straint based approaches to bioinformatics. DNA, RNA and proteins can be represented 
as strings. In the case of DNA, the letters are the familiar nucleotides A, G, C and T. In 
the case of proteins, the letters are the 20 amino acids. Many problems in bioinformatics 
involve matching DNA sequences against a database, a classic textual search problem in 
which regular expressions are commonly used. Other problems, such as reconstructing 
chromosomes from short DNA fragments (or clones), can be formalized as constraint 
satisfaction problems or constrained optimization problems using string constraints, and 
could be solved using advanced constraint satisfaction algorithms. However, this is left 
as a future work. 

8 Conclusions 

We have discussed an approach to constraint reasoning over strings in which regular 
languages are used to represent and reason about infinite sets of strings. Regular lan- 
guages have a number of qualities to recommend them as a domain representation. 

- They are closed under intersection, union and negation. 

- They can concisely represent infinite sets of strings 

- Many natural string constraints, such as concatenation, containment and length, can 
be represented in terms of operations on regular languages 

“ They are widely used and well understood. 

These advantages do come at a price; it can be substantially more costly to represent 
and reason about regular languages than, say intervals. On the other hand, the time 
and space complexity of constraint reasoning with regular languages can be literally 
infinitely less than that of reasoning over explicit sets of strings. 
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